Wire authserver DCR resolver and add structured logs#5044
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
4e90171 to
4e41c9b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5044 +/- ##
==========================================
+ Coverage 67.50% 67.56% +0.05%
==========================================
Files 601 601
Lines 61055 61262 +207
==========================================
+ Hits 41218 41391 +173
- Misses 16721 16745 +24
- Partials 3116 3126 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9b66570 to
fc135b4
Compare
4e41c9b to
4ee2863
Compare
fc135b4 to
3478e8c
Compare
4ee2863 to
283489f
Compare
3478e8c to
cc43b14
Compare
283489f to
7ebe96e
Compare
cc43b14 to
5ad7473
Compare
7ebe96e to
cbfb884
Compare
cbfb884 to
063e820
Compare
5ad7473 to
80d5cbe
Compare
Closes the 14 review threads from review #4199898531 that are scoped to this PR; #3 and #5 are intentionally deferred to PR #5044 which is the PR that consumes the ClientSecret plumbing and RedirectURI propagation. Behaviour fixes: - #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now preserve the issuer's path component, so multi-tenant upstreams that ship DCR without advertising it (or this auth server when it itself has a tenant prefix in its issuer) keep their tenant prefix in the synthesised URL and the defaulted redirect URI. - #7 resolveDCRCredentials wraps registration in a package-level singleflight.Group keyed by DCRKey so concurrent callers for the same upstream coalesce into a single RFC 7591 registration. The leader's closure rechecks the cache before any network I/O so followers that arrive just after a Put returns observe the fresh entry. - #9 applyResolution writes ClientID only when the run-config copy leaves it empty, treating it symmetrically with the endpoint writes. This is defence-in-depth for callers that bypass the resolver's validateResolveInputs precondition check. - #10 the cfg.RegistrationEndpoint short-circuit branch in resolveDCREndpoints validates URL well-formedness and HTTPS-or- loopback locally, so a non-HTTPS endpoint fails before performRegistration constructs a bearer-token transport for it. - #13 endpointsFromMetadata validates discovered authorization_endpoint and token_endpoint before populating dcrEndpoints, so a self- consistent metadata document with http:// URLs cannot flow through to the auth-code or token-exchange path. Doc / scope: - #6 trust-assumption note added on DCRUpstreamConfig with a link to the follow-up SSRF-hardening issue (#5135) per the "Document Architectural Constraints" rule. - #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what the in-memory store does and does NOT solve (cross-replica sharing, durability, cross-process write coordination). Test hygiene: - #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that added no meaningful coverage. - #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in TestFetchAuthorizationServerMetadataFromURL, matching the sibling pattern in dcr_test.go. - #12 new failingDCRStore test double + two cases asserting the cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache put") that operators see when a future Redis backend errors. - #14 renamed the TestNeedsDCR case from "client_id with dcr (invalid but returns false)" to "client_id wins over dcr_config (defensive AND semantic)" so the case label describes the intentional defensive behaviour rather than reading like a bug report. - #15 removed a //nolint:lll directive that was suppressing a non- existent violation (the line is 108 chars; the limit is 130). - #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out 16 goroutines doing alternating Put/Get against overlapping and disjoint keys, with a fail-fast deadline, so go test -race actually exercises the sync.RWMutex guard. New tests cover each behaviour change: - TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers pins exactly-one-registration under N concurrent callers. - TestSynthesiseRegistrationEndpoint_PreservesIssuerPath / TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path- preservation fix. - TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers the symmetric ClientID write. - TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers the local registration-endpoint guard. - TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers the metadata authz/token endpoint guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the 14 review threads from review #4199898531 that are scoped to this PR; #3 and #5 are intentionally deferred to PR #5044 which is the PR that consumes the ClientSecret plumbing and RedirectURI propagation. Behaviour fixes: - #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now preserve the issuer's path component, so multi-tenant upstreams that ship DCR without advertising it (or this auth server when it itself has a tenant prefix in its issuer) keep their tenant prefix in the synthesised URL and the defaulted redirect URI. - #7 resolveDCRCredentials wraps registration in a package-level singleflight.Group keyed by DCRKey so concurrent callers for the same upstream coalesce into a single RFC 7591 registration. The leader's closure rechecks the cache before any network I/O so followers that arrive just after a Put returns observe the fresh entry. - #9 applyResolution writes ClientID only when the run-config copy leaves it empty, treating it symmetrically with the endpoint writes. This is defence-in-depth for callers that bypass the resolver's validateResolveInputs precondition check. - #10 the cfg.RegistrationEndpoint short-circuit branch in resolveDCREndpoints validates URL well-formedness and HTTPS-or- loopback locally, so a non-HTTPS endpoint fails before performRegistration constructs a bearer-token transport for it. - #13 endpointsFromMetadata validates discovered authorization_endpoint and token_endpoint before populating dcrEndpoints, so a self- consistent metadata document with http:// URLs cannot flow through to the auth-code or token-exchange path. Doc / scope: - #6 trust-assumption note added on DCRUpstreamConfig with a link to the follow-up SSRF-hardening issue (#5135) per the "Document Architectural Constraints" rule. - #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what the in-memory store does and does NOT solve (cross-replica sharing, durability, cross-process write coordination). Test hygiene: - #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that added no meaningful coverage. - #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in TestFetchAuthorizationServerMetadataFromURL, matching the sibling pattern in dcr_test.go. - #12 new failingDCRStore test double + two cases asserting the cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache put") that operators see when a future Redis backend errors. - #14 renamed the TestNeedsDCR case from "client_id with dcr (invalid but returns false)" to "client_id wins over dcr_config (defensive AND semantic)" so the case label describes the intentional defensive behaviour rather than reading like a bug report. - #15 removed a //nolint:lll directive that was suppressing a non- existent violation (the line is 108 chars; the limit is 130). - #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out 16 goroutines doing alternating Put/Get against overlapping and disjoint keys, with a fail-fast deadline, so go test -race actually exercises the sync.RWMutex guard. New tests cover each behaviour change: - TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers pins exactly-one-registration under N concurrent callers. - TestSynthesiseRegistrationEndpoint_PreservesIssuerPath / TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path- preservation fix. - TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers the symmetric ClientID write. - TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers the local registration-endpoint guard. - TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers the metadata authz/token endpoint guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
isTransientNetworkError previously emitted the cached-DCR-client
remediation Warn from inside its classifier on every permanent 4xx.
A tight Token() loop hitting the same condition would spam the same
record on every call before the workload's Unauthenticated status
propagated. The same branch also fired for non-DCR workloads, which
saw a remediation telling them to "delete cached credentials" they
never had.
Strip the side effect from the classifier and emit the Warn from
markAsUnauthenticated, which already gates the close-monitoring
transition through stopOnce. The Warn now fires at most once per
state transition, is suppressed when no client_id context is
available, and reads honestly about the variability ("if this
workload uses cached DCR or CIMD credentials they may be stale").
Addresses #5044 review finding F5 (MEDIUM).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related polish items in the authserver handlers package: NewHandler did not validate that AuthorizationServerConfig (or its embedded *fosite.Config) was non-nil. issuer() reached into the config at request time, so a misconfigured caller would panic deep inside an HTTP handler instead of failing at startup. Add the nil check to the constructor and simplify issuer() to rely on the constructor invariant. Pin the new invariant with TestNewHandler_ErrorsOnNilConfig. The /oauth/register success log was promoted to Info even though the operation is neither long-running nor exceptional. Demote back to Debug; an audit-log path is the right home for the audit signal if that becomes a requirement. Addresses #5044 review findings F6 and F7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles ten review-feedback items in pkg/authserver/runner; each addresses internal-API hygiene or doc-comment drift in the DCR resolver, no behaviour change for callers. Sanitizer hardening (F1, F19, F18, F23): sanitizeErrorForLog now strips userinfo and fragment in addition to the query, since either can carry credentials or tokens (https://user:pass@host, implicit-flow #access_token=...). queryStrippingPattern matches http/https case-insensitively per RFC 3986 §3.1 so HTTPS://... cannot escape sanitisation. trimURLTrailingPunctuation switches to strings.IndexByte to match the ASCII-only terminator set without the rune-decoding overhead. Test cases added for each. Resolver error API (F4, F13, F20): the dcrStepRegister panic-recovery branch no longer emits a duplicate slog.Error; the captured stack travels with the wrapped *dcrStepError to the boundary log so a single panic produces a single record. DCRStepError / LogDCRStepError lowercased to dcrStepError / logDCRStepError since no caller lives outside the package. logDCRStepError now no-ops on nil err so the unknown-step branch cannot fire on a missing failure. Resolution helpers (F11, F12, F25): applyResolution renamed to consumeResolution to communicate that it is a one-shot state transition (clearing DCRConfig is unconditional), not an idempotent defaulting step. The applyResolutionToOAuth2Config doc now states the paired-call invariant explicitly without referencing a specific test. Lifecycle docs (F21, F22): the per-instance dcrStore vs. process-wide dcrFlight asymmetry is now stated on both sides, and EmbeddedAuthServer.Close documents the future-Close hook for a backend with handles. Inline rules-file rationale (F24): production comments no longer cite .claude/rules/... by path; the principle is inlined. Addresses #5044 review findings F1, F4, F11, F12, F13, F18, F19, F20, F21, F22, F23, F24, F25. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Body-only review items follow-up: F2 (HIGH) — Addressed in 4d504fe. F8 / F9 / F10 / F26-F30 (test-gap cluster) — Deferred. F26 was closed mechanically by the new CIMD case in Generated with Claude Code |
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
jhrozek
left a comment
There was a problem hiding this comment.
A few notes — none blocking.
Three follow-up items from review: isPermanentTokenEndpointError used to treat every non-5xx RetrieveError as permanent, which meant the DCR remediation Warn fired on transient back-pressure (408 Request Timeout, 429 Too Many Requests). Narrow the classifier to an explicit 400 / 401 / 403 allowlist so the "delete the cached credentials and restart" hint fires only on responses where stale cached client credentials are a plausible cause. Retry behaviour is unaffected — the retry decision is gated entirely by isTransientNetworkError, which already returns false for the whole RetrieveError branch. handleHTTPResponse in pkg/oauthproto/dcr.go now drains the response body after the LimitReader-bounded ReadAll. Without this, an upstream returning more than 8 KiB on /register error left the connection mid- stream when the deferred Close fired, preventing TCP connection reuse. Mirrors what the non-JSON content-type branch a few lines down already does. remoteAuthLogContext moved out of pkg/runner into pkg/auth/remote as *Config.LogContext(), collocating it with resolveClientCredentials so the cached-CIMD > cached-DCR > static precedence has one home. Adding a fourth cached field in future updates both call sites in one place. Tests moved to pkg/auth/remote/config_test.go. Addresses #5044 review #4212656411 comments 3174540081, 3174540082, and 3174540086. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
resolveDCRCredentialswas wired up and unit-tested in the prior sub-issue, butEmbeddedAuthServernever called it, so any upstream configured withDCRConfigstill booted without a client. This PR makes the resolver load-bearing and adds the operator-visible structured logs the Phase 2 spec calls for (cache-hit age, stale-registration warning, per-step errors, and a remediation hint when an upstream rejects a cached client as 4xx).EmbeddedAuthServernow owns an in-memoryDCRCredentialStoreand callsresolveDCRCredentials+consumeResolutionfor every OAuth2 upstream whose config requests DCR.buildPureOAuth2Configdeliberately keeps its pure, network-free signature; the DCR-resolvedClientSecretis layered on afterwards via a newapplyResolutionToOAuth2Confighelper so the two application sites stay side-by-side.RunConfigelement is shallow-copied and its OAuth2 sub-config is deep-copied before mutation, preserving the caller'sRunConfig.Upstreamsslice.resolveDCRCredentialsreturns a newdcrStepErrorwrapping the originating step (metadata discovery, DCR call, store read, store write) instead of logging-and-returning at each branch. The boundary caller (buildUpstreamConfigs) emits oneslog.ErrorvialogDCRStepError, so there is a single error line per failure. A panic inside the singleflight closure is converted to the samedcrStepErrorcarrying the captureddebug.Stack(), so the boundary log surfaces it without an in-defer duplicate.slog.Debugwithdcr_age_daysand additionallyslog.Warnwith remediation text when the cached registration exceeds the 90-daydcrStaleAgeThreshold./oauth/registerhandler success log emits atDebugwithissuer,client_id,software_id,token_endpoint_auth_method, andscopes— silent at INFO+ per the project's logging convention.SoftwareIDis threaded throughValidateDCRRequest, which now also caps it to 256 printable-ASCII characters.MonitoredTokenSourceacceptsupstreamandclient_idat construction (replacing the earlierSetUpstreamContextsetter, which had an unsynchronized writer). On the first transition toUnauthenticateddriven by a permanent token-endpoint 4xx, it emits a singleslog.Warnwith a DCR/CIMD remediation hint (gated bystopOnceso a tightToken()loop cannot spam it, and suppressed when noclient_idcontext is available). The classifier itself (isTransientNetworkError) is side-effect free.NewHandlernow fails loudly whenAuthorizationServerConfigor its embedded*fosite.Configis nil;issuer()no longer carries a runtime nil-guard.http(s)://case-insensitively per RFC 3986 §3.1, and trailing sentence punctuation around a URL is preserved so prose is not mangled.pkg/oauthproto/dcr.gohandleHTTPResponsecaps the upstream/registererror body at 8 KiB viaio.LimitReaderso a non-conformant or malicious upstream cannot inflate operator log volume.Closes #5039
Type of change
Test plan
task test)task lint-fix)New coverage added:
pkg/authserver/runner/embeddedauthserver_test.go— full DCR boot against a mock AS (metadata +/register), cache-hit short-circuit (zero additional HTTP requests on the secondbuildUpstreamConfigs), copy-before-mutate assertion on the caller'sRunConfig.Upstreamsslice, andTestNewEmbeddedAuthServer_DCRBootdriving the real constructor to assertdcrStoreis populated.pkg/authserver/runner/dcr_test.go— cache-hit Debug log withdcr_age_days, stale-age Warn above threshold, onedcrStepError-returning case per failure step, and the panic-recovery test now asserts the wrapped error is a*dcrStepError(dcrStepRegister, …, Stack: <captured>).pkg/authserver/server/handlers/handler_chain_test.go—TestHandler_issuercovers both populated and zero-valueAccessTokenIssuer.TestNewHandler_ErrorsOnNilConfig(inauthorize_test.go) pins the new constructor invariant.pkg/authserver/server/registration/dcr_test.go—software_idlength cap and non-printable-ASCII rejection.pkg/auth/monitored_token_source_test.go— permanent-4xx branch emits the remediation Warn withworkload/upstream/client_idpopulated via the new constructor parameters.pkg/runner/runner_test.go—TestRemoteAuthLogContextcovers the fullcached CIMD > cached DCR > staticprecedence.http.Client-style quoted URL, embeddeduser:pass@hostuserinfo (with and without query), implicit-flow#access_token=…fragments, and uppercaseHTTPS://schemes.Changes
pkg/authserver/runner/embeddedauthserver.godcrStorefield and wire resolver intobuildUpstreamConfigs; introduceapplyResolutionToOAuth2Config; document the per-instance store vs. process-wide singleflight asymmetry.pkg/authserver/runner/dcr.godcrStepError/logDCRStepError; emit cache-hit Debug, stale-age Warn, and per-step error context; carry panic stack on the wrapped error so a single boundary record surfaces it; URL-secret-stripping sanitizer (userinfo + query + fragment, case-insensitive scheme match).pkg/authserver/runner/dcr_store.gopkg/authserver/server/handlers/dcr.goDebug, drop redundantupstreamattribute, threadsoftware_idthrough.pkg/authserver/server/handlers/handler.goNewHandlervalidatesconfigandconfig.Config; remove the runtime nil-guard fromissuer().pkg/authserver/server/registration/dcr.gosoftware_idto 256 printable-ASCII characters; exportMaxSoftwareIDLength.pkg/auth/monitored_token_source.goupstream/client_idas constructor parameters; emit DCR/CIMD remediation Warn frommarkAsUnauthenticated(once per state transition, gated bystopOnce, suppressed whenclientID == ""). The classifier is side-effect free.pkg/runner/runner.goMonitoredTokenSourcefields fromRemoteAuthConfigwithcached CIMD > cached DCR > staticprecedence (mirrorsremote.Handler.resolveClientCredentials).pkg/oauthproto/dcr.go/registererror body at 8 KiB viaio.LimitReaderbefore embedding it in the returned error string.pkg/authserver/runner/embeddedauthserver_test.gopkg/authserver/runner/dcr_test.gopkg/authserver/server/handlers/handler_chain_test.go,authorize_test.goTestHandler_issuerandTestNewHandler_ErrorsOnNilConfig.pkg/authserver/server/handlers/dcr_test.go,pkg/authserver/server/registration/dcr_test.gosoftware_idvalidation.pkg/auth/monitored_token_source_test.go,pkg/runner/runner_test.goLarge PR Justification
The PR exceeds the 400-LOC / 10-file guideline (1342+/104- across 18 files), but the change cannot be split without leaving the codebase in an intermediate non-functional state:
*_test.go) — table-driven coverage for the resolver wiring, the new error taxonomy, the staleness + remediation logs, and the sanitizer.dcrStepErrortaxonomy + boundary logging, monitored-token-source remediation log, andsoftware_idvalidation. Splitting the resolver wiring without the matching error taxonomy would land a regression in observability; splitting the error taxonomy without the wiring would orphan the new types.Does this introduce a user-facing change?
Operators gain two new diagnostic signals on DCR-enabled upstreams:
Warnwhen a cached DCR registration is older than 90 days, hinting that the operator may want to rotate.Warnon the first transition toUnauthenticateddriven by a permanent token-endpoint 4xx, hinting that the workload's cached DCR or CIMD credentials may be stale and that deleting them and restarting will trigger re-registration. The Warn fires at most once per state transition and is suppressed entirely for workloads with noclient_idcontext.The
MonitoredTokenSourceconstructor signature has changed to requireupstreamandclient_id, but it is an internal API with no public surface, and the priorSetUpstreamContextsetter is removed.Successful DCR registrations log at
Debug, notInfo, in keeping with the project's silent-success-at-INFO+ convention.Special notes for reviewers
buildPureOAuth2Configis load-bearing: it takes nocontext.Contextand performs no network I/O. DCR resolution is deliberately sequenced around it, not inside it, so the pure config builder stays trivially testable. Please flag any future change that threadsctxinto it.dcrStepErrormoves the "one error line per failure" invariant from the resolver to the boundary caller. If a future caller forgets to invokelogDCRStepError, the error will still propagate normally but the structured log will be missing. Worth keeping in mind if we add more resolver call sites.consumeResolution(formerlyapplyResolution) is a one-shot state transition, not an idempotent default-fill: a second call after the first is a no-op only because the first clearedDCRConfig. The companionapplyResolutionToOAuth2Configwrites the inline-onlyClientSecretonto the built*upstream.OAuth2Config; any future output path fromOAuth2UpstreamRunConfigtoupstream.OAuth2Configmust call both helpers to get a fully-resolved DCR client.dcrFlightsingleflight is intentionally process-wide whileEmbeddedAuthServer.dcrStoreis per-instance — see the doc comments on each. The asymmetry is load-bearing for concurrentEmbeddedAuthServerinstances in the same process.Generated with Claude Code